In [16]:
from IPython.display import YouTubeVideo, HTML, Math, Image
# how to install anaconda on mac
YouTubeVideo('6Dv1wNvTPbg')
Out[16]:
Than you can copy every cells here or you can import this document in your anaconda folder (this part is not shown in the video I'll try to find an other video)
In [17]:
#This module will help you retrieve the content within the Hindawi table
from bs4 import BeautifulSoup
import urllib
import pandas as pd
In [18]:
hindawi_apc_url = 'http://www.hindawi.com/apc/'
hindawi_html_page = urllib.urlopen(hindawi_apc_url)
soup = BeautifulSoup(hindawi_html_page, 'xml')
In [19]:
HTML('<iframe src=http://www.hindawi.com/apc/ width=900 height=550></iframe>')
Out[19]:
This Hindawi page contains one HMTL table that we need to parse. Please look at the w3school documentation about 'table' if your not familliar with HTML
In [20]:
#Within the Table all the 'tr' (table rows)
content = soup.find_all('tr')
In [21]:
'''
<tr class="subscription_table_head">
<th>Journal Title</th>
<th>ISSN</th>
<th class="last_th">APC</th>
</tr>,
<tr class="subscription_table_plus">
<td>
<a href="/journals/aaa/">Abstract and Applied Analysis</a>
</td>
<td>1687-0409</td>
<td class="to_right">$800</td>
</tr>
...
'''
0
Out[21]:
Because the first tr will contains the table header (Journal title, ISSN, APC) we will start retrieving content after the first tr.
In [22]:
table =[]
#start with the second 'tr'
for value in content[1:]:
#This will find all the td within this 'tr'
value = value.find_all('td')
#index of VALUE: 0 , 1, 2
#value ===> 'Abstract and Applied', '1687-0429', '$800'
apc = value[2].text.strip()
#Let's remove the '$' sign if any
if "$" in apc:
apc = (apc.split('$'))[1]
apc = int(apc)
#if value == 'Free' than let's write 0 instead of Free
else:
apc = 0
table.append([value[0].text.strip(),value[1].text.strip(),apc] )
In [23]:
hindawi_apc_table = pd.DataFrame(table, columns=['Journal Title','ISSN','APC'])
In [24]:
hindawi_apc_table.head(10)
Out[24]:
In [25]:
# Export to Excel
hindawi_apc_table.to_excel('Hindawi_apc_table.xlsx', sheet_name = 'Hindawi_APC_Table', index = False)
In resume to scrape the Table. You will need these steps: